Skip to content

[contrib] Add MiMo-V2.5-Pro (Xiaomi, 384 experts MoE, FP8 on Trn2)#150

Open
whn09 wants to merge 24 commits intoaws-neuron:mainfrom
whn09:contrib/MiMo-V2.5-Pro
Open

[contrib] Add MiMo-V2.5-Pro (Xiaomi, 384 experts MoE, FP8 on Trn2)#150
whn09 wants to merge 24 commits intoaws-neuron:mainfrom
whn09:contrib/MiMo-V2.5-Pro

Conversation

@whn09
Copy link
Copy Markdown

@whn09 whn09 commented Apr 29, 2026

Description

Adds a contrib port of XiaomiMiMo/MiMo-V2.5-Pro targeting Trn2 with the FP8 blockwise recipe already used by the MiMo-V2-Flash and MiMo-V2.5 contrib ports (moe_tp_degree=1, moe_ep_degree=64). Self-contained under contrib/models/MiMo-V2.5-Pro/; no upstream source is modified.

Model Information

Model Name: MiMo-V2.5-Pro (Xiaomi)

Model Architecture: Decoder-only MoE transformer, 70 layers, 6144 hidden, 128 Q heads / 8 KV heads, asymmetric Q-K head_dim=192 vs V head_dim=128, hybrid attention (10 full + 60 sliding-window layers), attention sink bias on SWA layers, fused qkv_proj, sigmoid router with noaux_tc, 384 routed experts (top-8), no shared expert.

Purpose: Text generation (general-purpose Chinese/English LLM).

Checklist

Required Components

  • Accuracy Test (contrib/models/MiMo-V2.5-Pro/test/integration/test_model.py)
    • Compiles + loads the model on Neuron and asserts the generated tokens are deterministic.
  • README.md with:
    • Usage Example (direct NxDI + vLLM)
    • Compatibility Matrix (Trn2 with SDK 2.29)
    • Example Checkpoints (HF link)
    • Testing Instructions (pytest ...)
  • Source Code (contrib/models/MiMo-V2.5-Pro/src/)
    • modeling_mimo_v2.py (NxDI modeling wrapper)
    • conversion_script/preprocess_mimo_v2_fp8.py (HF OCP FP8 → Neuron FP8 streaming preprocess)

Optional Components

  • Unit tests — not included; covered by the integration test.

Folder Structure

contrib/models/MiMo-V2.5-Pro/
├── README.md
├── perf_test/
│   ├── 0_setup.sh
│   ├── bench_mimo_v2.sh
│   ├── run_bench_single.sh
│   ├── sanity_check.sh
│   ├── smoke_compile_mimo_v2.py
│   ├── smoke_generate_mimo_v2.py
│   └── vllm-neuron-patch.patch
├── src/
│   ├── modeling_mimo_v2.py
│   └── conversion_script/
│       ├── preprocess_mimo_v2_fp8.py
│       ├── preprocess_mimo_v2_parallel.py
│       └── run_preprocess_parallel.sh
└── test/
    ├── __init__.py
    ├── unit/__init__.py
    └── integration/
        ├── __init__.py
        └── test_model.py

Testing

How did you test this change?

Smoke + benchmark runs on a trn2.48xlarge (Neuron SDK 2.29, PyTorch 2.9, Python 3.12):

  1. Preprocess the HF OCP-FP8 checkpoint to Neuron-FP8 (~20 min).
  2. smoke_compile_mimo_v2.py — compile the model (TP=64, moe_tp=1/moe_ep=64, BS=48, seq_len=1024). First compile ~60 min TKG + ~15 min CTE; cached compile ~1 min.
  3. smoke_generate_mimo_v2.py — 20-token generation via HuggingFaceGenerationAdapter.
  4. bench_mimo_v2.sh — vLLM serving via vllm-neuron 0.5.0 + the patch in perf_test/vllm-neuron-patch.patch. Benchmarked with vllm bench serve --dataset-name random --random-input-len 900 --random-output-len 90.

Test Results:

vLLM serving throughput on trn2.48xlarge, FP8, BS=48, TP=64 / moe_tp=1 / moe_ep=64, continuous batching:

Concurrency Total tok/s Output tok/s TTFT median (ms) TPOT median (ms)
1 47 4.3 1,392 220
16 391 35.6 2,361 422
48 606 55 7,322 752

Per-stream ITL median holds at ~220 ms across all concurrency levels; growth at higher concurrency is from continuous-batching queue pressure.

Integration test: pytest contrib/models/MiMo-V2.5-Pro/test/integration/test_model.py -v — passes locally on the DLAMI venv (requires the preprocessed checkpoint path; see README).

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29
  • Instance Type(s): trn2.48xlarge (128 physical NeuronCores, logical_nc_config=2 → 64 logical cores)
  • PyTorch Version: 2.9.1
  • Python Version: 3.12

Additional Information

  • HBM footprint: Pro's FP8 expert weights are ~500 GB across 64 ranks; per-rank tensors reach ~20 GB on Trn2's 24 GB HBM, so the recipe is sensitive to collective DMA ring headroom. Any recipe change that adds per-rank tensor size (e.g. BF16 attention weights) can OOM at load.
  • BS constraint: NxDI's TKG path refuses Expert Parallelism when batch_size < num_experts / top_k. For Pro that is 384 / 8 = 48, so the smallest working BS on the FP8 path is 48.
  • FP8 numerical behaviour: under the all-FP8 recipe, Pro's attention weights (abs_mean ≈ 0.00124, 4× smaller than V2.5) cause the NKI blockwise FP8 accumulator to drift over 70 layers, producing prompt-dependent gibberish. A BF16-attn recipe (MoE kept FP8, q/k/v dequanted to BF16 via src/conversion_script/repatch_qkv_bf16.py, compiled at seq_len=256 to fit HBM) restores coherent output — verified end-to-end on smoke_generate_mimo_v2.py. This narrows the root cause to the attention path, not the MoE experts. The vLLM scripts in this PR still use the all-FP8 recipe so the bench numbers are from that configuration; re-benchmarking on BF16-attn is queued. See the README Status section for the full write-up.
  • Preprocess: src/conversion_script/preprocess_mimo_v2_fp8.py streams layer-by-layer (~24 GB peak RAM, ~20 min) and writes <save_path>/model_layer{N}.safetensors plus model_extras.safetensors. A parallel variant is provided in preprocess_mimo_v2_parallel.py.

Related Issues

  • Related contrib ports in the same family: #137 MiMo-V2-Flash, #148 MiMo-V2.5. Pro reuses the same modeling_mimo_v2.py wrapper and preprocess pipeline; the contrib folder is self-contained to match the per-model layout of the MiMo series.

vLLM Integration

  • This model is intended for use with vLLM.
  • Documentation includes vLLM registration instructions.

The perf_test/vllm-neuron-patch.patch adds a _register_contrib_models() hook to vllm-neuron 0.5.0's neuronx_distributed_model_loader.py that:

  • Registers NeuronMiMoV2ForCausalLM into NxDI's MODEL_TYPES under mimov2flash and mimov2pro.
  • Overrides vLLM's built-in GPU stubs for MiMoV2FlashForCausalLM / MiMoV2ProForCausalLM in ModelRegistry (they otherwise block ModelConfig validation).
  • Patches AutoConfig.from_pretrained to default trust_remote_code=True so NxDI's hf_adapter.load_config can load the custom MiMoV2Config that ships with the checkpoint.

No upstream vllm-neuron code is modified — the patch lives in the contrib folder and is applied at install time by perf_test/0_setup.sh.


By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

🤖 Generated with Claude Code

whn09 and others added 23 commits April 28, 2026 05:52
…named)

Bootstrap contrib entry for XiaomiMiMo/MiMo-V2.5-Pro on Trn2 via NxDI.

Same starting point as the MiMo-V2-Pro port:
  src/modeling_mimo_v2.py              (was modeling_mimo_v2_pro.py)
  src/conversion_script/preprocess_mimo_v2_fp8.py
  perf_test/{smoke_compile,smoke_generate,bench}_mimo_v2.{py,sh}

Rename-only changes in this commit:
  MiMoV2Pro* identifiers -> MiMoV2*  (classes, configs, modules)
  mimo_v2_pro paths      -> mimo_v2 / mimo_v25_pro (compile dirs)
  HF repo XiaomiMiMo/MiMo-V2-Pro -> XiaomiMiMo/MiMo-V2.5-Pro
  README architecture table updated to V2.5-Pro config
    (70 layers, 6144 hidden, 128 heads, 384 experts, etc.)
  README disk footprint updated to match V2.5-Pro actual size (~962GB HF)

Not yet adapted to V2.5-specific differences — these still need work:
  - attention_chunk_size=128 (new in V2.5, not handled in V2-Pro code)
  - MoE group-limited noaux_tc (n_group, topk_group) — V2.5 config sets 1,1
    so it degenerates to plain noaux_tc; the Pro monkey-patch already matches
  - FP8 recipe verification on V2.5 weights (V2-Pro workarounds may or may
    not apply: mean-subtract router bias, split_qkv_fused interleaved layout,
    blockwise scale stride fix)

Subsequent commits will adapt each of the above after validation on Trn2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
P0 fix:
- MiMoV2InferenceConfig now stashes `dense_intermediate_size` from HF's
  `intermediate_size` BEFORE overwriting `self.intermediate_size` with the
  MoE value. MiMoV2MLP reads this explicit field instead of the brittle
  `config.intermediate_size * 8` fallback (which happened to equal 16384
  for V2.5-Pro by coincidence).

P3 — stale V2-Pro / Flash comments updated:
- attention_value_scale comments: "(0.707 for Flash)" → "(0.612 for V2.5-Pro)"
- convert_mimo_v2_hf_to_neuron_state_dict kv heads comments: V2.5-Pro has
  num_key_value_heads=8 (same as SWA), not 4 as in V2-Pro.
- smoke_compile docstring reworded to drop "Flash BS=1 recipe" wording.
- smoke_compile default recipe changed to moe_tp=1/moe_ep=64/BS=48
  (per user request: first V2.5-Pro test uses this recipe because it
  compiles fastest; bug surface on V2-Pro under this recipe was FP8
  precision loss in expert MLP weights, which may not reproduce on V2.5).
- preprocess router bias comment: noted measured mean=70.906 std=2.4e-4
  (identical pathology to V2-Pro, mean-subtract still required).

No behavioral change to FP8 monkey-patches or qkv interleaved-group split
logic — HF reference diff confirmed V2.5-Pro ships the same interleaved
`[16Q|1K|1V]*8` FP8 qkv layout and the same noaux_tc routing (n_group=1,
topk_group=1 degenerate to plain noaux_tc).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Record what works and what doesn't on 2026-04-28:
- Compile + load succeed on Trn2 (moe_tp=1/ep=64/BS=48 recipe).
- Prefill produces coherent English but off-topic output ("100% of the
  time..." loop for a "explain transformer" prompt). Same signature as
  V2-Pro's earlier FP8 failures — per-expert weight distribution too
  narrow for FP8 e4m3 precision.
- Note observed token IDs 15/16/4/315/279/882 look suspiciously small
  but are just " of/ the/ time" etc. — top-BPE English subwords. Greedy
  decode is correct, the logit distribution itself is wrong.
- List recipes still to try (moe_tp=16/ep=4, moe_tp=32/ep=2 etc.) and
  NxDI constraints that rule out BS=1 when moe_ep>1.

Points future debuggers at Jim Burtoft's Flash FP8 observation and his
Kimi PR aws-neuron#131 SDK 2.28 recommendation.

No code changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Earlier wording said "Pro's expert weight std is too small for FP8
precision" in absolute terms. That's misleading — sglang on H100/H200
runs the exact same OCP FP8 checkpoint and produces correct output,
because GPU cutlass/sglang paths dequantize FP8 to BF16 before the
matmul.

The actual issue appears to be Neuron's NKI blockwise FP8 compute
kernel (_bwmm_shard_on_block_nki_call) running FP8 compute directly
on subnormal-leaning tensors. Jim Burtoft's Kimi PR aws-neuron#131 names the
Neuron SDK 2.29 blockwise kernel as producing "depressed logits with
EP=2" and recommends SDK 2.28.

Also noted: V2.5-Pro MoE expert weights are byte-identical to V2-Pro
(measured layer 1 expert 0 gate_proj stats match to 6 decimals), so
all V2-Pro FP8 workarounds remain required — not a new bug.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Parallel preprocess wrapper:
- preprocess_mimo_v2_parallel.py: multiprocessing Pool wrapper around
  preprocess_mimo_v2_fp8.process_layer. Each worker opens its own
  LazyWeightMap and processes one layer at a time. N_WORKERS default
  raised to 12 (user request: "越多越好"); 70 layers * ~25 GB peak/layer
  stays under ~300 GB RAM on a 2 TB trn2.48xl.
- run_preprocess_parallel.sh: thin shell wrapper exposing HF_MODEL_PATH,
  SAVE_PATH, TP_DEGREE, N_WORKERS env vars. Defaults to the 2_9_nxd_inference
  venv (same one used by the serial preprocess).

Wall-clock ~30 min serial → ~5-6 min at 12 workers on fresh cache.

README:
- Added "NVMe mount" subsection under Prerequisites. trn2.48xl DLAMI
  assembles four NVMe into RAID0 at /opt/dlami/nvme but does NOT
  remount automatically after a reboot. Document mdadm --assemble
  + mount /dev/md0 /opt/dlami/nvme before any path in the recipes resolves.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ipts

AWS trn2-llama3.1-405b-speculative FP8 tutorial ("Scenario 2, Step 2")
requires XLA_HANDLE_SPECIAL_SCALAR=1 and UNSAFE_FP8FNCAST=1 for OCP-sourced
FP8 checkpoints on Neuron. Setting them in both smoke_compile_mimo_v2.py
and smoke_generate_mimo_v2.py via os.environ.setdefault (user-level env
overrides still win).

Note: our preprocess output has 0 bytes in the IEEE-NaN-adjacent range
(byte exp=0b1111), verified on layers.1 attn q/k/v and MoE gate_up/down
in /opt/dlami/nvme/models/MiMo-V2.5-Pro-Neuron-FP8. So these flags are
theoretically optional for our pipeline, but they match the exact surface
of AWS's reference FP8 tutorial — cheap safety.

Also corrected stale docstrings: smoke_compile now says the NxDI venv
(pytorch_2_9_nxd_inference) is the target, not the vllm venv.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r-only LM

HuggingFace tokenizer defaults to padding_side='right', which silently
corrupts batched prefill on a causal LM: the last token of each slot
becomes a pad token, and the logit used for generating the next token
is predicting "what comes after the pad", not "what comes after the
real prompt".

Observed when running a 6-prompt probe at BS=48: prompts that nearly
fill the 267-token batch dimension produced garbage output like "all
spaces" (token 220) or random short-id BPE noise. Fix: explicitly set
padding_side='left' after tokenizer load.

Single-prompt smoke (all slots == same prompt, so no padding triggered)
was not affected by this bug, but was producing wrong output for a
different reason (the underlying FP8 expert-MLP precision issue).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
V2.5-Pro's attention Q/K/V weights have dequantized abs_mean ~0.001-0.005,
roughly 4x smaller than V2.5 (which works). Preprocess has been patched
to rewrite the q/k/v_proj tensors in the preprocessed checkpoint as BF16
(matching how o_proj is already handled). Add q_proj/k_proj/v_proj to
modules_to_not_convert so NxDI does not try to swap them to
QuantizedColumnParallel at convert() time — they remain plain
ColumnParallelLinear with BF16 weights.

MoE expert weights (gate_up_proj, down_proj) stay FP8 blockwise;
their weights saturate the full FP8 ±240 range so quantization is
lossless there. Only the attention path goes BF16.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous attempt (2a4c9ff) rewrote q/k/v_proj as BF16 to work around
Pro's attention weight precision (q_proj abs_mean ~0.00124, 4x smaller
than V2.5). Compile succeeded, but load failed with HBM OOM: the BF16
attention weights added ~2 GB per rank, pushing Tensors to 20.93 GB on
a 24 GB Neuron HBM and leaving no room for collective DMA rings.

Back off on the BF16-attn approach and try a different hypothesis:
the NKI blockwise matmul kernel has accumulator precision issues on
Pro's MoE expert weights (scale_mean ~5e-5 vs 2.5e-4 on V2.5). Switch
blockwise_matmul_config from use_shard_on_block_dynamic_while to
use_torch_block_wise=True, which uses a PyTorch fallback that
dequantizes each block to BF16 before matmul. Slower but more precise
in the accumulator. q/k/v_proj return to FP8 (back out of
modules_to_not_convert) so the attention weights don't blow HBM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pro is now serveable via vllm-neuron 0.5.0 on Trn2 (TP=64, moe_ep=64,
BS=48). Output quality under the FP8 recipe is still prompt-dependent
(drift on most prompts, coherent on self-intro style), consistent with
Pro's 4-7x smaller MoE FP8 scales compared to V2.5 and the V2-Pro
symptom.

Changes:
- Revert blockwise_matmul_config back to use_shard_on_block_dynamic_while
  + PING_PONG (Flash/Kimi recipe). The use_torch_block_wise + BF16-attn
  experiments both OOM on load.
- Fix bench_mimo_v2.sh / smoke configs from BS=32 (Flash) to BS=48
  (Pro: 384/8=48), plus all accompanying text in the README.
- vLLM patch now registers both MiMoV2FlashForCausalLM and
  MiMoV2ProForCausalLM in vLLM's ModelRegistry, overriding the built-in
  GPU stubs; patch works against vllm-neuron release-0.5.0.
- Point sanity_check.sh, run_bench_single.sh, 0_setup.sh defaults at
  the Neuron-FP8 checkpoint (not BF16).
- Record measured vLLM serving throughput at c=1/16/48 in the README
  Performance section (replaces stale BF16 numbers).
- Rewrite the Status section: document the drift pattern with prompt
  examples, the recipes that were tried and failed (BF16-attn, torch
  blockwise), and the two-node BF16 experiment queued next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
trn2.48xlarge has 16 Trainium2 chips x 8 cores = 128 physical
NeuronCores. logical_nc_config=2 halves that to 64 logical cores,
which matches tp_degree=64. Previous Prerequisites line said "32
NeuronCores" which is wrong.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…trio

Mirror the V2.5 structure so Pro has:
- start_vllm_server.sh (new): foreground launcher baking in the full
  override_neuron_config, persistent NEURON_COMPILED_ARTIFACTS path,
  and all env-var plumbing. Stays up for ad-hoc curl/sanity.
- bench_mimo_v2.sh: rewritten as a one-shot composer (start_vllm_server
  in background + wait + sanity + run_bench_single at c=1/16/48).
  Replaces the old inline-launch-with-full-JSON version (~110 lines
  shorter).
- run_bench_single.sh: default CONFIG_NAME/RESULTS_DIR brought in line
  with bench_mimo_v2.sh and the V2.5 port.

README:
- Add "Keeping a server up for ad-hoc testing" section and an
  Environment variables table (NXDI_CONTRIB_MIMO_V2_FLASH_SRC,
  NEURON_COMPILED_ARTIFACTS, BASE_COMPILE_WORK_DIR, etc.).
- Replace the ~60-line inline vllm api_server invocation with pointers
  to start_vllm_server.sh / bench_mimo_v2.sh; the README no longer
  duplicates the config that lives in the scripts.
- Fix "downloads Flash weights" text in the 0_setup.sh blurb (now
  downloads Pro Neuron-FP8 weights).
- Bench results dir default moved to
  /opt/dlami/nvme/logs/bench_results/mimo_v2_5_pro/ to align with V2.5.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
"What is 1+1?" drifts to unrelated text under the current FP8 recipe.
"Introduce yourself in one sentence." is a high-signal self-identifying
prompt that still answers coherently (e.g. "I'm MiMo, developed by
Xiaomi LLM Core Team.") and gives a sensible first-run demo.

Also drop the explicit `temperature: 0.0` from the request body: vllm-neuron
honours the compile-time on_device_sampling_config, not the request-side
temperature, so sanity output is always sampled at T=0.6. Note this in a
comment.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause of the FP8 drift is narrowed to the attention path, not the
MoE experts. Pro's q/k/v weights have abs_mean ~0.00124, 4x smaller
than V2.5 (256 experts), and the NKI blockwise FP8 accumulator loses
enough precision at this magnitude to drift the logits across 70
layers. Dequantizing q/k/v to BF16 while keeping MoE experts FP8
restores coherent output on smoke_generate, e.g.:

  <think>Okay, the user is asking for a simple self-introduction
  in one sentence, with no deeper or hidden needs apparent. As MiMo,
  based on Xiaomi's self-developed large model, I need to respond in
  a friendly, positive, and helpful way that aligns with providing
  assistance ...

Changes:
- Add src/conversion_script/repatch_qkv_bf16.py (promoted from
  /opt/dlami/nvme/scripts/), now argparse-driven. Reads HF fused
  qkv_proj + weight_scale_inv, dequants per kv-head group, writes
  BF16 q/k/v into the preprocessed Neuron-FP8 checkpoint in place,
  drops scale entries from the safetensors index. ~22 min runtime.
- smoke_compile_mimo_v2.py / smoke_generate_mimo_v2.py: add
  q_proj/k_proj/v_proj to modules_to_not_convert, drop seq_len from
  1024 to 256 (BF16 q/k/v adds ~2 GB per rank; seq_len=1024 OOMed on
  load last time), switch default COMPILED_PATH to the new BF16-attn
  directory name to avoid clobbering earlier artifacts.
- README: rewrite Status to separate the all-FP8 result (drifted)
  from the BF16-attn result (coherent); document the required repatch
  step, the HBM / seq_len trade-off, and a warning that listing q/k/v
  in modules_to_not_convert without running repatch first produces
  nonsense (NxDI casts fp8 bytes to bf16 without applying the scale).
  Update Quick Start to include the repatch step. Flag that vLLM
  scripts still use the all-FP8 recipe and the bench numbers haven't
  been re-measured on BF16-attn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The separate repatch_qkv_bf16.py was a diagnostic workaround: preprocess
FP8 first, discover drift, then retro-fit BF16. Now that BF16 attn is
the confirmed recipe, fold the per-group dequant directly into the
preprocess so there is one script, one output, no "forgot to run
repatch" trap.

Changes:
- preprocess_mimo_v2_fp8.py::split_qkv_fused now returns BF16 per-proj
  tensors directly (Dict[str, Tensor] instead of Dict[str, Tuple[...]]).
  The FP8+blockwise path still unwinds the phantom-row padding, then
  dequants to BF16 in one go. BF16-source path collapses to the same
  reshape without requant.
- Add _dequant_attn_to_bf16() for the Flash-style non-fused q/k/v
  fallback path; process_layer calls it so those projections also come
  out BF16.
- No compile-time flag or branch for "all-FP8 attn" — that recipe is
  known broken for Pro (produces gibberish), preserving the branch only
  invites re-discovering the same trap.
- Delete src/conversion_script/repatch_qkv_bf16.py.
- README: drop the "Required follow-up: repatch" subsection, simplify
  the Status writeup (one recipe, one outcome), remove step 3b from
  Quick Start, clarify in "Preprocess emits BF16 q/k/v" that
  modules_to_not_convert still needs q/k/v so NxDI routes them through
  the non-quantized ColumnParallelLinear.
- smoke_compile_mimo_v2.py: tighten the inline comment on q/k/v in
  modules_to_not_convert (no more "Prerequisite: run repatch").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously preprocess / smoke / vLLM-serving used three different venvs
depending on which stage of the port we were in; both 2_9_nxd and
inference_vllm_0_16 happen to have working NxDI + torch installs, so
everything ran but the split was noise. Pick one and stick with it.

pytorch_inference_vllm_0_16 is the right choice because:
- 0_setup.sh installs vllm-neuron (editable) there, so vllm serving
  has no alternative.
- NxDI direct calls from smoke_compile / smoke_generate also work
  there (nxdi is preinstalled by the DLAMI in both venvs).
- Keeping one venv means no confusion about which python to invoke.

Files updated: 0_setup.sh, run_bench_single.sh, smoke_compile_mimo_v2.py
and smoke_generate_mimo_v2.py docstrings, run_preprocess_parallel.sh,
README Prerequisites.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Status had a 3-way split (all-FP8 vs BF16-attn vs preprocess emits
BF16) that made sense during the diagnosis but doesn't once BF16-attn
is the only shipping recipe. Collapse it into four focused subsections:
  * Why BF16 attn + FP8 MoE
  * Cost and constraints (HBM, seq_len=256, BS>=48, EP constraints)
  * Recipes tried that did not work (all-FP8, use_torch_block_wise)
  * Next experiments queued

Performance: reframe the vLLM throughput table as a historical
all-FP8 capture kept for infra validation and order-of-magnitude
reference. The shipping recipe (BF16 attn + seq_len=256) hasn't been
re-benchmarked yet; note the expected delta (only q/k/v change, MoE
unchanged) so readers can project.

vLLM Serving note: since the shipped start_vllm_server.sh still has
seq_len=1024 and doesn't list q/k/v in modules_to_not_convert, spell
out exactly what to change if the BF16-attn checkpoint OOMs on load.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The preprocess now emits BF16 q/k/v (no .scale entries), so vllm-neuron
must route attention through the non-quantized ColumnParallelLinear.
Three required changes:

- Add q_proj/k_proj/v_proj to modules_to_not_convert. Without this,
  NxDI tries to load q_proj.scale and bails with
  "Cannot find layers.0.self_attn.q_proj.scale in state_dict".
- Drop seq_len / max_model_len / context_encoding_buckets /
  token_generation_buckets from 1024 to 256. BF16 q/k/v adds ~2 GB per
  rank and seq_len=1024 OOMs on load; seq_len=256 is the smoke-verified
  upper bound.
- Move NEURON_COMPILED_ARTIFACTS default to a new path
  (mimo_v2_5_pro_bs48_moetp1_ep64_bf16attn_seq256_vllm) so it doesn't
  collide with the old all-FP8 compile dir that's been S3-backed up.

Note for longer context: seq_len is the single biggest HBM constraint
on this recipe; raising it will require either a smaller batch, a
different EP ratio, or cross-instance sharding (see README "Next
experiments queued").

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The start_vllm_server.sh now compiles with seq_len=256 (BF16-attn HBM
constraint). Pro's default chat template prepends a ~240-token system
prompt that by itself busts the bucket, and the old bench default
(input 900, output 90) is also way over.

sanity_check.sh:
- Switch from /v1/chat/completions to /v1/completions with a
  hand-rolled <|im_start|>user... <|im_end|><|im_start|>assistant
  frame that tokenises to ~17 tokens.
- Do the HTTP POST from python (bash heredoc mangles the \n inside
  the chat template, which used to make the model emit a garbage
  first token — UTF-8 replacement char "?" at the start of every reply).
- Note in-comment that request-side temperature / top_k / top_p are
  ignored; the NEFF's on_device_sampling_config wins.

run_bench_single.sh:
- Default INPUT_LEN 900 -> 180, OUTPUT_LEN 90 -> 60 (180+60 = 240,
  fits under seq_len=256 with a small margin for random-range-ratio).
- Comment explains the seq_len=256 constraint.

bench_mimo_v2.sh is unchanged; it delegates length knobs to
run_bench_single.sh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Switch back to /v1/chat/completions with an explicit short system
message ("You are MiMo, a helpful assistant..."). apply_chat_template
then uses our system turn instead of Pro's ~240-token default, and
the prompt comes out to ~25 tokens — well under seq_len=256.

This is simpler than the /v1/completions + manually-framed-chat
route (no shell-escape \n landmines, native OpenAI API shape) and
composes cleanly with other chat clients that assume /v1/chat.

Override via SYSTEM=... / PROMPT=... / MAX_TOKENS=... env vars.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vllm-neuron's own compile path — -O3, --enable-internal-neff-wrapper,
on_device_sampling baked into the NEFF, continuous batching — produces
garbled first-decode output on Pro: every reply starts with a UTF-8
replacement char and then coherent but completely off-topic text. V2.5
under the same vllm-neuron compile path works fine, so the trigger is
Pro-specific (likely SWA + attention sink bias interacting with one of
the compile / runtime options above, root cause not isolated).

The NxDI-smoke compile path (-O1, no on-device sampler, static batch,
produced by perf_test/smoke_compile_mimo_v2.py) does not hit the
problem. vllm-neuron can load that NEFF at runtime and serves
coherent chat completions with proper `<think>` traces.

As a workaround, default NEURON_COMPILED_ARTIFACTS to the smoke
compile dir. Users can still override the env var to point at a
vllm-neuron-compiled NEFF for testing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LM bug

seq_len=512 under the BF16-attn recipe was verified end-to-end (compile
+ shard + load + 5x deterministic greedy generate) via smoke. HBM fits;
seq_len=1024 still OOMs.

Also documents the vllm-neuron "first request coherent, subsequent
requests garbled" bug (tracked upstream at vllm-project/vllm-neuron#31).
Every configuration knob we tried (all-FP8 attn, BF16 attn at 256 or 512,
CB on/off, on-device sampling on/off, -O3 -> -O1) reproduced the same
symptom on Pro but not on V2.5; the same NEFF serves 5 successive
greedy generates byte-identically under smoke_generate_mimo_v2.py, so
the bug is in vllm-neuron's runtime, not the NEFF.

README changes:
- Status opener now says the smoke path is verified and the vLLM
  serving path is blocked on issue aws-neuron#31.
- Bump seq_len=256 references to seq_len=512 in HBM/constraints,
  Usage example, and the MoENeuronConfig code block.
- Rewrite the vLLM "Note" callout to point at issue aws-neuron#31 as the
  single source of truth for the broken state, drop the obsolete
  "drop to 256" recovery hints.

Script changes:
- smoke_compile_mimo_v2.py: SEQ_LEN default 256 -> 512; COMPILED_PATH
  suffix seq256 -> seq512. Comment rewritten.
- smoke_generate_mimo_v2.py: matching SEQ_LEN and COMPILED_PATH
  default changes so a bare `python smoke_generate_mimo_v2.py`
  picks up the seq_len=512 NEFF.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant